Content Based Web Sampling

نویسندگان

  • Yini Bao
  • Erwin M. Bakker
چکیده

Web characterization methods have been studied for many years. Most of these methods focus on textbased web contents. Some of them analyze the contents of a web page by analyzing its HTML code, hyper links, and/or DOM 1 structure. Seldom, a web page is characterized based on its visual appearance. A good reason for also considering the visual appearance of a web page is because humans initially perceive a web page as an image, and only then will look in detail at text and further pictorial contents. Hence it is a more natural way of trying to analyze and classify the contents of the web pages. Moreover, as more and more new web technologies appear in recent years (JavaScript, FLASH 2 , and AJAX 3 ); analyzing the HTML code in a web page seems to be meaningless without actually parsing and interpreting it. This offers new challenges to textual web page characterization and has an impact on the efficiency of the indexing techniques. Thus, by combining the old text classification methods with our novel (visual) content based methods we offer a more promising way to characterize the web. The main idea of the project is to take snapshot for each page and uses image classification methods to categorize them.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating Healthcare Personnel’s Satisfaction with Quality of Web-based Learning in Teaching Preventive Behaviors of Hepatitis B Virus Infection

Introduction: Acceptance and implementation of preventive behaviors through new methods by healthcare personnel are of great importance. The aim of this study was to investigate healthcare personnel’s satisfaction with quality of web-based learning in teaching preventive behaviors of hepatitis B virus infection.Methods: This descriptive study was conducted on 120 healthcare employees in Tehran ...

متن کامل

Query-Based Sampling: Can we do Better than Random?

Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random qu...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

The Application of Sampling to the Design of Structural Analysis Web Crawlers

The growth of the World Wide Web (WWW) has seen it evolve into a rich information resource. It is constantly traversed with the aid of crawlers so as to harvest web content. When collecting data, crawlers have the potential of causing service denial to web servers. This paper proposes the application of sampling as a selection strategy in the design of structural analysis web crawlers. This has...

متن کامل

An Anatomy of a Large-scale Image Search Engine

As the World-Wide Web moves rapidly from text-based towards multimedia content, and requires more personalized access, we deem existing infrastructures inadequate. In this paper, we present critical components for enabling effective searches in Web-based or large-scale image libraries. In particular, we propose a perception-based search component, which can learn users’ subjective query concept...

متن کامل

Promoting Music Sampling by Semantic Web-enhanced DRM tools

The digital revolution has provided new incentives and facilities for content creation. Music has particularly benefited from this opportunity and creative processes like music sampling currently constitute their techniques in digital technologies. However, the digitisation has also carried some shortcomings, like the proliferation of DRM tools that menace traditional rights and usages. This pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JDCTA

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2010